Topic Modelling with Word Embeddings
نویسندگان
چکیده
English. This work aims at evaluating and comparing two different frameworks for the unsupervised topic modelling of the CompWHoB Corpus, namely our political-linguistic dataset. The first approach is represented by the application of the latent DirichLet Allocation (henceforth LDA), defining the evaluation of this model as baseline of comparison. The second framework employs Word2Vec technique to learn the word vector representations to be later used to topic-model our data. Compared to the previously defined LDA baseline, results show that the use of Word2Vec word embeddings significantly improves topic modelling performance but only when an accurate and taskoriented linguistic pre-processing step is carried out. Italiano. L’obiettivo di questo contributo è di valutare e confrontare due differenti framework per l’apprendimento automatico del topic sul CompWHoB Corpus, la nostra risorsa testuale. Dopo aver implementato il modello della latent DirichLet Allocation, abbiamo definito come standard di riferimento la valutazione di questo stesso approccio. Come secondo framework, abbiamo utilizzato il modello Word2Vec per apprendere le rappresentazioni vettoriali dei termini successivamente impiegati come input per la fase di apprendimento automatico del topic. I risulati mostrano che utilizzando i ‘word embeddings’ generati da Word2Vec, le prestazioni del modello aumentano significativamente ma solo se supportati da una accurata fase di ‘pre-processing’ linguisti-
منابع مشابه
Improving Distributed Word Representation and Topic Model by Word-Topic Mixture Model
We propose a Word-Topic Mixture(WTM) model to improve word representation and topic model simultaneously. Firstly, it introduces the initial external word embeddings into the Topical Word Embeddings(TWE) model based on Latent Dirichlet Allocation(LDA) model to learn word embeddings and topic vectors. Then the results learned from TWE are integrated in the LDA by defining the probability distrib...
متن کاملA Correlated Topic Model Using Word Embeddings
Conventional correlated topic models are able to capture correlation structure among latent topics by replacing the Dirichlet prior with the logistic normal distribution. Word embeddings have been proven to be able to capture semantic regularities in language. Therefore, the semantic relatedness and correlations between words can be directly calculated in the word embedding space, for example, ...
متن کاملImproving Twitter Sentiment Classification Using Topic-Enriched Multi-Prototype Word Embeddings
It has been shown that learning distributed word representations is highly useful for Twitter sentiment classification. Most existing models rely on a single distributed representation for each word. This is problematic for sentiment classification because words are often polysemous and each word can contain different sentiment polarities under different topics. We address this issue by learnin...
متن کاملIntegrating Topic Modeling with Word Embeddings by Mixtures of vMFs
Gaussian LDA integrates topic modeling with word embeddings by replacing discrete topic distribution over word types with multivariate Gaussian distribution on the embedding space. This can take semantic information of words into account. However, the Euclidean similarity used in Gaussian topics is not an optimal semantic measure for word embeddings. Acknowledgedly, the cosine similarity better...
متن کاملSub-Word Similarity based Search for Embeddings: Inducing Rare-Word Embeddings for Word Similarity Tasks and Language Modelling
Training good word embeddings requires large amounts of data. Out-of-vocabulary words will still be encountered at test-time, leaving these words without embeddings. To overcome this lack of embeddings for rare words, existing methods leverage morphological features to generate embeddings. While the existing methods use computationally-intensive rule-based (Soricut and Och, 2015) or tool-based ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016